Log in to GitHub to determine your team number and members for Lab 3.
Every team member should now go to the course GitHub organization and
locate your lab 3 repository, which should have the prefix
lab-03. Clone the repository by copying the url in
Github under the SSH tab in the Code drop-down and creating
a New Project using version control in RStudio. If you have
trouble, see the first lab for step-by-step instructions or ask a
teammate for help. Do not edit the .Rmd file until
explicitly asked to do so in the instructions.
Monkeypox virus is endemic in central and west Africa, and in 2022 a significant global outbreak is occurring in non-endemic areas.
Our World in Data curates a
number of really interesting data resources. In this lab we use their
monkeypox data repository, which is updated daily. We start by
visualizing these data. (Note: I’ve hidden the code below, but you can
look at it in the class organization under
website/docs/slides/week-03/lab-03-prob-teams.Rmd . You can change the
variable plotdate in the code to look at total cases per
million population on another day, if desired, but this lab focuses on
August 22, 2022.)
For this lab, you might want to do some pencil and paper calculations and then turn them in. If you know LaTeX-style equation coding, you can use that in R Markdown. If not, you can include an external image (e.g., a picture from your phone of your paper) easily. For example, someone in my family will be getting the following holiday gift from Snorg Tees (good advice in general!). The code to include it is below, and make sure that you save/upload the picture file in the same folder as this Rmd file:
Assign each team member a number 1 through 4 and write your number down on a piece of paper. This lab will walk you through the basics of team workflow step-by-step. If your team has just three members, use your favorite method (e.g., rock-paper-scissors) to randomly assign one member to be team member 4 as well.
Do the following exercises in order, following each step carefully.
Only one person at a time should type in the
.Rmd file and push updates.
The person working should share their screen, and the others should follow along.
Team member 1: Open
the lab3.Rmd file and change the author of the YAML header
to the following “Team Number: Member 1, Member 2, Member 3, Member 4”
with your team number (for example Team 3) and the first and last names
of all team members.
Team member 1: Run
the subset-data code chunk to subset the data to August 22,
select only location, new cases, and total cases (not standardized to
population), and print the first 6 rows and the last 6 rows. Share the
results with your team members. Then, answer the questions below. (For
the rest of the assignment, we will consider only cases through August
22.)
library(tidyverse)
plotdate = "2022-08-22"
#pick off date of interest
case_day <- case_series %>%
filter(date == plotdate) %>%
select(location,new_cases,total_cases)
head(case_day,6) #you'll want to modify this line!## location new_cases total_cases
## 1 Argentina 0 72
## 2 Australia 0 89
## 3 Austria 0 218
## 4 Belgium 47 671
## 5 Bolivia 0 43
## 6 Brazil 109 3896
tail(case_day,6)## location new_cases total_cases
## 48 Thailand 0 5
## 49 Turkey 0 5
## 50 Uruguay 1 3
## 51 United States 1308 15357
## 52 Venezuela 0 1
## 53 World 2063 44275
# code to add up the total cases in the dataset not attributed to "World"
library(janitor)##
## Attaching package: 'janitor'
## The following objects are masked from 'package:stats':
##
## chisq.test, fisher.test
totals <- case_day %>%
filter(location != "World") %>%
adorn_totals("row")
totals## location new_cases total_cases
## Argentina 0 72
## Australia 0 89
## Austria 0 218
## Belgium 47 671
## Bolivia 0 43
## Brazil 109 3896
## Canada 10 1179
## Switzerland 17 416
## Chile 63 270
## Colombia 109 273
## Cuba 0 1
## Cyprus 0 4
## Czechia 0 39
## Germany 29 3295
## Denmark 6 169
## Dominican Republic 0 7
## Ecuador 0 20
## Spain 0 6119
## Estonia 0 9
## France 0 2888
## United Kingdom 145 3346
## Ghana 0 47
## Greece 2 52
## Guatemala 0 4
## Guyana 1 1
## Honduras 0 3
## Croatia 0 22
## Hungary 1 63
## Ireland 0 126
## Israel 5 213
## Italy 0 689
## Jamaica 0 4
## Luxembourg 0 45
## Morocco 0 1
## Mexico 134 386
## Montenegro 1 2
## Netherlands 1 1092
## Norway 0 76
## Panama 0 7
## Peru 60 1188
## Poland 10 114
## Puerto Rico 2 77
## Portugal 0 810
## Romania 0 34
## Saudi Arabia 0 6
## Slovakia 2 12
## Sweden 0 141
## Thailand 0 5
## Turkey 0 5
## Uruguay 1 3
## United States 1308 15357
## Venezuela 0 1
## Total 2063 43610
Team member 1: When
you have finished, knit to PDF, then stage, commit, and push your
.Rmd and PDF to GitHub with an appropriate commit
message.
All other team
members: Once your team member has pushed the work, pull
to get the updated documents from GitHub. Click on the .Rmd
file and you should see the responses to the first two exercises. Knit
the file to update your own documents.
Team member 2: It’s your turn. Answer the question below.
options(scipen=999)
case_day %>%
filter(location != "World") %>%
#the mutate command creates a variable that gives the percent of total cases from each location
mutate(percentcase=total_cases/sum(total_cases))## location new_cases total_cases percentcase
## 1 Argentina 0 72 0.00165099748
## 2 Australia 0 89 0.00204081633
## 3 Austria 0 218 0.00499885347
## 4 Belgium 47 671 0.01538637927
## 5 Bolivia 0 43 0.00098601238
## 6 Brazil 109 3896 0.08933730796
## 7 Canada 10 1179 0.02703508370
## 8 Switzerland 17 416 0.00953909654
## 9 Chile 63 270 0.00619124054
## 10 Colombia 109 273 0.00626003210
## 11 Cuba 0 1 0.00002293052
## 12 Cyprus 0 4 0.00009172208
## 13 Czechia 0 39 0.00089429030
## 14 Germany 29 3295 0.07555606512
## 15 Denmark 6 169 0.00387525797
## 16 Dominican Republic 0 7 0.00016051364
## 17 Ecuador 0 20 0.00045861041
## 18 Spain 0 6119 0.14031185508
## 19 Estonia 0 9 0.00020637468
## 20 France 0 2888 0.06622334327
## 21 United Kingdom 145 3346 0.07672552167
## 22 Ghana 0 47 0.00107773446
## 23 Greece 2 52 0.00119238707
## 24 Guatemala 0 4 0.00009172208
## 25 Guyana 1 1 0.00002293052
## 26 Honduras 0 3 0.00006879156
## 27 Croatia 0 22 0.00050447145
## 28 Hungary 1 63 0.00144462279
## 29 Ireland 0 126 0.00288924559
## 30 Israel 5 213 0.00488420087
## 31 Italy 0 689 0.01579912864
## 32 Jamaica 0 4 0.00009172208
## 33 Luxembourg 0 45 0.00103187342
## 34 Morocco 0 1 0.00002293052
## 35 Mexico 134 386 0.00885118092
## 36 Montenegro 1 2 0.00004586104
## 37 Netherlands 1 1092 0.02504012841
## 38 Norway 0 76 0.00174271956
## 39 Panama 0 7 0.00016051364
## 40 Peru 60 1188 0.02724145838
## 41 Poland 10 114 0.00261407934
## 42 Puerto Rico 2 77 0.00176565008
## 43 Portugal 0 810 0.01857372162
## 44 Romania 0 34 0.00077963770
## 45 Saudi Arabia 0 6 0.00013758312
## 46 Slovakia 2 12 0.00027516625
## 47 Sweden 0 141 0.00323320339
## 48 Thailand 0 5 0.00011465260
## 49 Turkey 0 5 0.00011465260
## 50 Uruguay 1 3 0.00006879156
## 51 United States 1308 15357 0.35214400367
## 52 Venezuela 0 1 0.00002293052
Team member 2: Knit
to PDF, then stage, commit, and push your .Rmd and PDF to
GitHub with an appropriate commit message.
All other team
members: Once your team member has pushed the work, pull
to get the updated documents from GitHub. Click on the .Rmd
file and you should see the responses to the first three exercises. Knit
the file.
Team member 3: It’s your turn. Complete the exercise below.
The starter code rearranges the cases to facilitate the plotting,
creating a new dataset tidycase. Some plotting options are
supplied to get you started, but you’ll need to fill out the code to
create the plot!
# create new variable percent new cases
# store in same dataset
# change name of variable new_cases to new
case_day <- case_day %>%
mutate(old=total_cases-new_cases, new=new_cases) %>% #this drops the old new_cases variable
select(-new_cases)
#this formats the data for better plotting
# we'll learn more about this in data wrangling notes
# we're making two observations per country - one for new and one for old cases
tidycase <- case_day %>%
filter(location != "World") %>% #drop entire world summaries
pivot_longer(cols=c("new","old"),
names_to = "type",
values_to = "count") %>%
select(location, type, count) %>%
mutate(type=as.factor(type))
#take a peek at new dataset
head(tidycase)## # A tibble: 6 × 3
## location type count
## <chr> <fct> <dbl>
## 1 Argentina new 0
## 2 Argentina old 72
## 3 Australia new 0
## 4 Australia old 89
## 5 Austria new 0
## 6 Austria old 218
# now finally make the plot!
# this is commented out for now because it doesn't run
# until you make edits!
#tidycase %>%
# ggplot(aes(x = , y = , fill= )) +
# note the stat="identity" option is needed because our data
# have been summarized already (new and old cases have already
# been counted)
# geom_bar(stat="identity",position="fill") +
# element text makes the font size larger or smaller; play with this
#theme(axis.text = element_text(size = 4)) +
# labs()Team member 3: Knit
to PDF, then stage, commit, and push your .Rmd and PDF to
GitHub with an appropriate commit message.
All other team
members: Once your team member has pushed the work, pull
to get the updated documents from GitHub. Click on the .Rmd
file and you should see the responses to the first four exercises. Knit
the file.
Team member 4: It’s your turn. Complete the exercise below.
case_day %>%
#these are the North American countries with cases up to August 22
filter(location %in% c("United States","Canada","Puerto Rico","Panama","Dominican Republic","Guatemala","Mexico")) %>%
# this code calculates percent of cases from each location from the North American countries in the filter statement above
mutate(percent=total_cases/sum(total_cases))## location total_cases old new percent
## 1 Canada 1179 1169 10 0.0692836575
## 2 Dominican Republic 7 7 0 0.0004113534
## 3 Guatemala 4 4 0 0.0002350591
## 4 Mexico 386 252 134 0.0226831992
## 5 Panama 7 7 0 0.0004113534
## 6 Puerto Rico 77 75 2 0.0045248869
## 7 United States 15357 14049 1308 0.9024504907
Team member 4: Knit
to PDF, then stage, commit, and push your .Rmd and PDF to
GitHub with an appropriate commit message.
All other team
members: Once your team member has pushed the work, pull
to get the updated documents from GitHub. Click on the .Rmd
file and you should see the responses to the first four exercises. Knit
the file.
Team member 1: It’s your turn again. Answer the question below with help from your team.
That is, we will get a very rough estimate of each country’s population by calculating the approximate population as the number of cases divided by the cases per million population rate times one million.
Let \(A\) be the event a person in this set of data is a US resident, and let \(B\) be the event a person has monkeypox. If country of residence and infection status are independent in these North American countries, then \(P(A|B) = P(A)\). If this condition is satisfied, then we’d want to check the condition for other countries in North America to be sure country and infection are independent. If the condition is not satisfied for the US, than the two variables are not independent, and we don’t have to bother checking other countries.
You may find the following code helpful!
# creates new dataset, data5, that contains total cases as of August 22, and population of each country, among the North American countries with cases
data5 <- case_series %>%
filter(date == plotdate) %>%
filter(location %in% c("United States","Canada","Puerto Rico","Panama","Dominican Republic","Guatemala","Mexico")) %>%
mutate(approxpop = 1000000*total_cases/total_cases_per_million) %>%
select(location,approxpop,total_cases)
data5## location approxpop total_cases
## 1 Canada 38155340 1179
## 2 Dominican Republic 11111111 7
## 3 Guatemala 17621145 4
## 4 Mexico 126723572 386
## 5 Panama 4350528 7
## 6 Puerto Rico 3256089 77
## 7 United States 336998025 15357
Team member 1: When
you have finished, knit to PDF, then stage, commit, and push your
.Rmd and PDF to GitHub with an appropriate commit
message.
All other team
members: Once your team member has pushed the work, pull
to get the updated documents from GitHub. Click on the .Rmd
file and you should see the responses to the first two exercises. Knit
the file to update your own documents.
Team member 2: It’s your turn. Answer the question below.
case_series.
The variable new_cases shows the number of new cases
reported each day of the outbreak. Filter the data to include the
country of your choice and create a scatterplot showing the case trend
over time in this country. Describe this trend in a couple of
sentences.Because the variable date in that data set is viewed as
a character rather than date variable, we first need to reformat it
using the code below.
# change formatting of date from character to date format
case_series <- case_series %>% #make a change and save to dataset of same name
mutate(date=as.Date(date,'%Y-%m-%d'))
# ggplot tip: this code below angles the x axis tick labels
# helpful if the labels overlap each other and are hard to read
# + theme(axis.text.x=element_text(angle=60, hjust=1)) Team member 2: When
you have finished, knit to PDF, then stage, commit, and push your
.Rmd and PDF to GitHub with an appropriate commit
message.
All other team
members: Once your team member has pushed the work, pull
to get the updated documents from GitHub. Click on the .Rmd
file to see your final version of the lab.
Team member 3: Upload your team’s PDF to Gradescope. Include every team member’s name in the Gradescope submission and identify which problems are on each page in Gradescope. Associate the “Overall” section with the first page of your PDF.
There should only be one submission per team on Gradescope.
Total: 50 pts
Exercise 1: 7 pts
Exercise 2: 7 pts
Exercise 3: 7 pts
Exercise 4: 7 pts
Exercise 5: 7 pts
Exercise 6: 7 pts
Overall: 8 pts